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20 years in the making 


^ http://blog.archive.org/2016/09/19/the-internet-archive-turns-20 
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The (archived) web... 



• ... is a very valuable dataset to study the web (and the offline world) 

• Access to very diverse knowledge from various discliplines (history, politics,...) 

• The whole web at your fingertips / processable snapshots 

• Adds a temporal dimension to the web / captures dynamics 

• ... is a widely unstructured collection of data 

• Access and analysis at scale is challenging 

• Processing petabytes of data is expensive and time-consuming 

• Difficult to discover, identify, extract records and contained information 

• Potentially highly technical, complex access and parsing process 

• Low-level details that users / researchers / data scientists don't want to / can't deal with 

• Data engineering needed to be used in downstream applications / studies 
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Web data from different perspectives 
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Web Data Engineering 



• Transforming data into useful information 

• Making it usable for downstream applications 

• Search, data science, digital humanities, content analysis,... 

• Regular users, researchers, data scientists / analysts,... 

• Enabling efficient and effective access through... 

• ... infrastructures 

• ... suitable data formats 

• ... simple tools / APIs 

• ... optimized indexes 



• Technical considerations made by computer scientists 

• to help users / researchers focus on their application / study / research 

• to hiding complexity / low-level details through flexible abstractions 
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(Temporal) search in web archives 
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• Challenges / considerations: 

• Documents are temporal / consist of multiple versions 

• Temporal relevance in addition to textual relevance 

• Information needs are different from traditional IR 

• Indexing (temporal) web archive data with existingt IR 

| fm obama merkel jjj| 
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0 Static Online > Layout > Fotos > 45073 Germany Obama Visit 

http://static.rp-online.de/layout/fotos/l40xn2/45073-Cermany_Obama_Visit_DRE101.jpg | live 

Bundeskanzlerin Angela Merkel (CDU) ist am Freitag in Dresden mit US-Prosident Barack Obama zu einem Vier-Augen-Gespr... Kanzlerin 
begr??te Obama am Vormittag im Flotel Taschenbergpalais Kempinski, wo Obama ubemachtet hatte ... ++ LIVEI-Ticker... 

Earliest version \ Latest version | 7 archived version from 2010 to 2010 \ Site stats 



0 Images Zeit > Politik > Ausland > 2010 > Merkel Obama 210610 > Merkel Obama 210610 

http://images.zeit.de/politik/ausland/2010-06/merkel-obama-210610/merkel-obama-210610-220xl24.jpg | live 

Merkel Obama 

Earliest version \ Latest version \ 4 archived versions from 2013 to 2015 1 Site stats 



0 Welt > Multimedia > Archive > 01264 > Merkel Obama Wikil 

http://www.welt.de/multimedia/archive/01264/merkel_obama_wikil_1264639p.jpg | live 

Merkel, Obama 

Earliest version \ Latest version 178 archived versions from 2010 to 2072 | Site stats 


search all micro ccTLDs 

Search the whole earth web archive: All pages under any top-level domain (here: only micro ccTLDs), keywords are of mixed languages. 
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Data formats 



• Example: (CDX) Attachment Format (ATT / CDXA) 

• lightweight, decoupled from data 

• lives next to the data, easily shareable, many parallel versions 

• efficient loading, integrated data validation 

• constant lookup times when loading along with records 

• universal container format for all kinds of derivations 

• can be transformed into all other exchange formats 
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CDX (Capture Index) with pointers to correcsponding (W)ARC records: 

*.cdx 

com, yahoo, answers, es) / 20060616001149 http://es.an ... 200 Y2P2LXHTCPGLNZOFAZ 
com, yahoo, answers, espanol) / 20060617034947 http:// ... text/html 200 RMMUE3QW 
com, yahoo, answers, fr) / 20060625153331 http://fr.an ... 200 30LFJYPP5Y3V7 50PD5 

com, yahoo, answers, hk) / 20150819101628 https://hk.a ... 0 5CUBOU4KW75IILS5D6H6 

com, yahoo, answers, id) / 20070629224925 http://id.an ... 200 XEXA3 2 HHE AHWLVN 5 2 J 

com, yahoo, answers, in) / 20060422210325 http://in.an ... 200 7LZ JPKLXDVE5DG2RIO 

com, yahoo, answers, it) / 20060618041859 http://it.an ... 200 4 5PAAZHDBC JY65YSBX 


*. cdx. Ian g_2017-18_ v2. cdxa. gz 


# Language detection using 'square 
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We have the data available 



• Example: Dataset of all homepages in Global Wayback (gwb - web.archive.org) 

• Extracted from snapshot 20180911224740 

• GWB-20180911224740_homepages.cdx.gz 

• Pre-processed attachments 

• GWB-20180911224740_homepages-*.cdx.gz 

• GWB-20180911224740_homepages-*.cdx.last-success-revisit.cdxa.gz 

• GWB-20180911224740_homepages-*.cdx.last-success-revisit.lang_2017-18.cdxa.gz 

• GWB-20180911224740_homepages-*.cdx.last-success-revisit.Iang_2017-18_v2.cdxa.gz 

• GWB-20180911224740_homepages-*.cdx.last-success.cdxa.gz 

• GWB-20180911224740_homepages-*.cdx.last.cdxa.gz 



# The last available capture 

Y2P2LXHTCPGLNZOFAZASQSSPN2WQGZ7W com,yahoo,answers,es)/ 20180904025943 https://es.answers.yahoo.com/ text/html 200 GG5KH5IZBH3X 
RMMUE3QW6LEGK6XSODPVSW3GAB5VUMMQ com,yahoo,answers,espanol)/ 20180905123902 https://espanol.answers.yahoo.com/ text/html 200 EA 
30LFJYPP5Y3V750PD57BTIHNHLPHL5IW com,yahoo,answers,fr)/ 20180904220720 https://fr.answers.yahoo.com/ text/html 200 PHFBMN4ZE5CF 
5CUBOU4KW751ILS5D6H6DR53YDHS3ZWI com,yahoo,answers,hk)/ 20180903232241 https://hk.answers.yahoo.com/ text/html 200 ELEYZG4TWCM5 
XEXA32HHEAHWLVN52JYKNIZZSVBYV3PC com,yahoo,answers,id)/ 20180903231347 https://id.answers.yahoo.com/ text/html 200 SNSCWXFNXP05 
7LZJPKLXDVE5DG2RIOZA33N4BUPY2D3Y com,yahoo,answers,in)/ 20180906005337 http://in.answers.yahoo.com/ text/html 301 7E7XC5R5K34US 
45PAAZHDBCJY65YSBXIJEVVCHN7QCYHX com,yahoo,answers,it)/ 20180903232244 https://it.answers.yahoo.com/ text/html 200 LSSQLAY2SJY5 
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Putting it all together: Sparkling ☆ 
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(Internal) data processing library based on Apache Spark 

• Goal to integrate all APIs to work with (temporal) web data in one 

• Continuous work in progress, growing with every new task 

Rich of features 



• Efficient CDX / (W)ARC loading, parsing and writing from / to HDFS, Petabox,... 

• Fast HTML processing without expensive DOM parsing (SAX-like) 

• Internal PetaBox authentication / access features 

• ATT / CDXA attachment loaders and writers 

• Shell / Python integration for computing derivations 

• Distributed budget-aware repartitioning (e.g., 1GB per partition / file) 

• Advanced retry / timeout / failure handling 

• Lots of utilities for logging, file handling, string operations, URL/SURT formatting,... 

• Easily configurable, library-wide constants and settings 
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hARCHIVE 


You can too: ArchiveSpark (3.0) 



• Same efficient data loading, access and processing approach 

• Declarative workflows, seamlessly integrates metadata / records 

* Open source 

• Available on GitHub: https://github.com/helgeho/ArchiveSpark 

• with documentation, docker image, and recipes for common tasks 


• Recently updated 

• Streamlined dependencies and package structure 

• Even more simplified API 

• Lots of bug fixes and improvements 

• Widely based on Sparkling 



• Includes large parts, benefits from Sparkling fixes and updates 
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We're at your service! 
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• Archive-lt Research Services (ARS) 

• WAT (extended metadata files) 

• LGA (temporal graphs) 

• WANE (named entities) 

• Special Seed Services (Artificial Zone Files) 



• Language + GeolP analysis 
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• Nation Wide Web (NWW) Search 

• Customized / regional web + media search 

• APIs 


1,139 results 
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© Static Online > Layout > Fotos > 45073 Germany Obama Visit 

http://static.rp-online.de/layout/fotos/l40xn2/45073-Cermany_Obama_Visit_DRE101.jpg | live 

Bundeskanzlerin Angela Merkel (CDU) ist am Freitag in Dresden mit US-Prosident Barack Obama zu einem Vier-Augen-Gespr Kanzlerin 
begr??te Obama am Vormittag im Hotel Taschenbergpalais Kempinski, wo Obama iibemachtet hatte ++ LIVEI-Ticker... 

Earliest version \ Latest version \ 1 archived version from 2010 to 2010 \ Site stats 


• WASAPI data-transfer API (Archive-lt) 

• Availability API + CDX Server (Wayback) 

• More to come soon, stay tuned... 



© Images Zeit > Politik > Ausland > 2010 > Merkel Obama 210610 > Merkel Obama 210610 

http://images.zeit.de/politik/ausland/2010-06/merkel-obama-210610/merkel-obama-210610-220xl24.jpg | live 
Merkel Obama 

Earliest version \ Latest version \ 4 archived versions from 2013 to 2015 | Site stats 

© Welt > Multimedia > Archive > 01264 > Merkel Obama Wikil 

http://www.welt.de/multimedia/archive/01264/merkel_obama_wikil_1264639p.jpg | live 

Merkel, Obama 

Earliest version \ Latest version \ 18 archived versions from 2010 to 2012 | Site stats 
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Thank you! 

• archive.org 

• archive-it.org 

• github.org/helgeho/ArchiveSpark 
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Questions? 
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If interested in our work, 
please get in touch! 
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